In the vibrant and diverse King County real estate market, including Seattle’s dynamic environment, property prices are shaped by an array of variables. The primary challenge is to construct a predictive model that can accurately estimate house prices within this area. Utilizing a comprehensive dataset that encompasses diverse house attributes, this model aims to decode the complex mechanisms influencing house pricing.
The model strives to offer precise price predictions for properties in King County by effectively correlating various house features with their market prices. This aspect is crucial in understanding and quantifying how different characteristics impact the value of a property.
A key goal of the model is to unearth and interpret the multitude of factors that play a significant role in determining house prices within the region. This venture goes beyond mere statistical analysis to provide practical, real-world insights, thereby enriching the understanding of real estate dynamics for all stakeholders.
The model is designed to be a powerful asset for a range of users, including real estate agents, prospective buyers, and sellers. By offering accurate price predictions and deep market insights, it aids in making informed and strategic decisions in the property market.
Initial data preparation is vital to ensure accuracy in the model. This stage involves cleansing the data, converting data types, and creating dummy variables for categorical features. Following this, an exploratory data analysis (EDA) is conducted to delve into the dataset’s characteristics, examining statistical summaries and relationships between variables.
The process involves using statistical techniques like stepwise regression for feature selection and conducting tests like the Variable Inflation Factor (VIF) and Anderson-Darling to check for multicollinearity and normality, respectively. Additionally, diagnostic plots are used for detecting outliers.
A range of models are employed and assessed:
Linear Models: Including Ordinary Least Squares (OLS) and Weighted Least Squares (WLS).
Regularization Techniques: Such as Ridge, Lasso, and Elastic Net to handle multicollinearity.
Robust Regression: Utilizing Huber’s method to minimize the influence of outliers.
Advanced Models: Exploring alternatives like regression trees, neural networks (NN), or support vector machines (SVM).
The model’s effectiveness is evaluated using metrics like RMSE and R-squared, across both the training (70%) and testing (30%) data sets, to ensure its reliability and applicability in real-world scenarios.
This introduction sets the stage for a comprehensive analysis, highlighting the multifaceted approach adopted in this project. From meticulous data preparation to sophisticated modeling, the endeavor is not just to predict house prices accurately but also to provide valuable insights into King County’s real estate market.
The King County house sales dataset is a comprehensive collection of 21,613 observations, each representing a unique house sale. The dataset encompasses a variety of features that describe different aspects of the houses sold. Below is a detailed description of each variable in the dataset:
| Variable | Description |
|---|---|
id |
Unique ID for each home sold (not used as a predictor) |
date |
Date of the home sale |
price |
Price of each home sold |
bedrooms |
Number of bedrooms |
bathrooms |
Number of bathrooms, “.5” accounts for a bathroom with a toilet but no shower |
sqft_living |
Square footage of the apartment interior living space |
sqft_lot |
Square footage of the land space |
floors |
Number of floors |
waterfront |
A dummy variable for whether the apartment was overlooking the waterfront or not |
view |
An index from 0 to 4 of how good the view of the property was |
condition |
An index from 1 to 5 on the condition of the apartment |
grade |
An index from 1 to 13 about building construction and design quality |
sqft_above |
The square footage of the interior housing space above ground level |
sqft_basement |
The square footage of the interior housing space below ground level |
yr_built |
The year the house was initially built |
yr_renovated |
The year of the house’s last renovation |
zipcode |
The zipcode area the house is in |
lat |
Latitude coordinate |
long |
Longitude coordinate |
sqft_living15 |
The square footage of interior housing living space for the nearest 15 neighbors |
sqft_lot15 |
The square footage of the land lots of the nearest 15 neighbors |
The dataset’s preparation involved meticulous cleaning and transformation processes to optimize it for accurate predictive analysis. Key steps undertaken include:
id variable, representing a unique identifier for
each house sale, does not contribute to predicting house prices and was
therefore removed. This step is crucial in focusing the model on
variables that influence the outcome (price).lat (latitude)
and long (longitude) were initially retained for their
crucial role in calculating geographical distances, which could
potentially influence house prices.date variable, initially in a string format, was
transformed into a numeric format. This conversion is essential for
incorporating the date into statistical models, as numeric
representations are more amenable to various types of analysis.price, sqft_living,
sqft_lot, etc., necessary conversions were performed to
ensure they are in a suitable numeric format.waterfront,
view, condition, and grade were
transformed into dummy variables. This transformation is pivotal for
regression analysis as it allows these non-numeric variables to be
effectively included in the model.waterfront, which is a binary indicator
itself, and for ordinal variables like view and
condition, which have intrinsic order but need to be
numerically represented for modeling.bathrooms, where values like “0.5”
represent bathrooms with a toilet but no shower, the data was kept as
is, considering these nuances convey important information about the
house’s characteristics.zipcode variable was transformed by extracting the
first three digits, which helps in reducing the number of dummy
variables and preventing the model from becoming overly complex while
still capturing the geographical influences on house prices.grade variable was clustered into broader
categories to simplify the model and focus on significant differences in
construction and design quality.haversine_distance is particularly
significant for understanding the spatial relationships and proximity to
key locations that might affect house prices.# Data Preprocessing and Transformation
set.seed(123) # Setting a seed for reproducibility
split_index <- sample(1:nrow(df), size = 0.7 * nrow(df))
train_df <- df[split_index, ]
test_df <- df[-split_index, ]
# Remove non-numeric characters from the 'price' column and convert it to numeric
train_df$price <- as.numeric(str_replace_all(train_df$price, "[^0-9.]", ""))
test_df$price <- as.numeric(str_replace_all(test_df$price, "[^0-9.]", ""))
# Calculation of Convergence Point: Determine the convergence point for high-value homes
high_value_threshold <- quantile(train_df$price, probs = 0.90, na.rm = TRUE) # Calculate the high-value threshold
high_value_homes <- train_df[train_df$price >= high_value_threshold, ] # Select high-value homes
convergence_point <- c(mean(high_value_homes$lat, na.rm = TRUE), mean(high_value_homes$long, na.rm = TRUE)) # Calculate the convergence point
# Data Transformation Function with Distance Binning Option
transform_data <- function(df, convergence_point, linear_model) {
# Date Transformation: Convert the 'date' column to a Date object if present
if ("date" %in% colnames(df)) {
df$date <- as.Date(substr(as.character(df$date), 1, 8), format="%Y%m%d")
# Date-Time Feature Engineering: Extract various date-related features
df$year_sold <- lubridate::year(df$date)
df$month_sold <- lubridate::month(df$date)
df$day_sold <- lubridate::day(df$date)
df$season <- factor(lubridate::quarter(df$date), labels = c("Winter", "Spring", "Summer", "Fall"))
df$week_of_year <- lubridate::week(df$date)
df$day_of_year <- lubridate::yday(df$date)
}
# Creating Dummy Variables: Convert categorical variables into dummy variables
df <- df %>%
mutate(zipcode = as.factor(zipcode),
waterfront = as.factor(waterfront),
view = as.factor(view),
condition = as.factor(condition),
grade = as.character(grade)) %>%
dummy_cols(select_columns = c('zipcode', 'view', 'condition', 'grade'))
# Remove last dummy variables to avoid multicollinearity
if (linear_model) {
df <- df[, !(names(df) %in% c("zipcode_98199", "view_0", "condition_1", "grade_13"))]
}
# Haversine Distance Function: Calculate the distance between two points on Earth's surface
haversine_distance <- function(lat1, long1, lat2, long2) {
R <- 6371 # Earth radius in kilometers
delta_lat <- (lat2 - lat1) * pi / 180
delta_long <- (long2 - long1) * pi / 180
a <- sin(delta_lat/2)^2 + cos(lat1 * pi / 180) * cos(lat2 * pi / 180) * sin(delta_long/2)^2
c <- 2 * atan2(sqrt(a), sqrt(1 - a))
d <- R * c # Calculate the haversine distance
return(d)
}
# Calculate Haversine Distance
df$distance_to_convergence <- mapply(haversine_distance, df$lat, df$long,
MoreArgs = list(lat2 = convergence_point[1], long2 = convergence_point[2]))
# Remove columns that are no longer needed
df <- df[, !(names(df) %in% c("id", "date", "zipcode", "view", "condition", "grade"))]
return(df)
}
# Applying the transformation function to training and test sets
train_df_linear <- transform_data(train_df, convergence_point, linear_model = TRUE) # Transform the training data for linear models
test_df_linear <- transform_data(test_df, convergence_point, linear_model = TRUE) # Transform the test data for linear models
train_df_non_linear <- transform_data(train_df, convergence_point, linear_model = FALSE) # Transform the training data
test_df_non_linear <- transform_data(test_df, convergence_point, linear_model = FALSE) # Transform the test data
| price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | sqft_above | sqft_basement | yr_built | yr_renovated | lat | long | sqft_living15 | sqft_lot15 | year_sold | month_sold | day_sold | season | week_of_year | day_of_year | zipcode_98001 | zipcode_98002 | zipcode_98003 | zipcode_98004 | zipcode_98005 | zipcode_98006 | zipcode_98007 | zipcode_98008 | zipcode_98010 | zipcode_98011 | zipcode_98014 | zipcode_98019 | zipcode_98022 | zipcode_98023 | zipcode_98024 | zipcode_98027 | zipcode_98028 | zipcode_98029 | zipcode_98030 | zipcode_98031 | zipcode_98032 | zipcode_98033 | zipcode_98034 | zipcode_98038 | zipcode_98039 | zipcode_98040 | zipcode_98042 | zipcode_98045 | zipcode_98052 | zipcode_98053 | zipcode_98055 | zipcode_98056 | zipcode_98058 | zipcode_98059 | zipcode_98065 | zipcode_98070 | zipcode_98072 | zipcode_98074 | zipcode_98075 | zipcode_98077 | zipcode_98092 | zipcode_98102 | zipcode_98103 | zipcode_98105 | zipcode_98106 | zipcode_98107 | zipcode_98108 | zipcode_98109 | zipcode_98112 | zipcode_98115 | zipcode_98116 | zipcode_98117 | zipcode_98118 | zipcode_98119 | zipcode_98122 | zipcode_98125 | zipcode_98126 | zipcode_98133 | zipcode_98136 | zipcode_98144 | zipcode_98146 | zipcode_98148 | zipcode_98155 | zipcode_98166 | zipcode_98168 | zipcode_98177 | zipcode_98178 | zipcode_98188 | zipcode_98198 | view_1 | view_2 | view_3 | view_4 | condition_2 | condition_3 | condition_4 | condition_5 | grade_3 | grade_4 | grade_5 | grade_6 | grade_7 | grade_8 | grade_9 | grade_10 | grade_11 | grade_12 | distance_to_convergence |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 475000 | 4 | 2.50 | 2040 | 16200 | 2.0 | 0 | 2040 | 0 | 1997 | 0 | 47.7366 | -121.958 | 2530 | 15389 | 2015 | 3 | 7 | Winter | 10 | 66 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 23.544800 |
| 316000 | 4 | 1.50 | 2120 | 46173 | 2.0 | 0 | 2120 | 0 | 1974 | 0 | 47.6503 | -121.968 | 2000 | 46173 | 2015 | 5 | 8 | Spring | 19 | 128 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 19.029290 |
| 802000 | 4 | 2.25 | 2130 | 8734 | 2.0 | 0 | 2130 | 0 | 1961 | 0 | 47.5672 | -122.161 | 2550 | 8800 | 2014 | 9 | 4 | Summer | 36 | 247 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 6.897144 |
| 905000 | 4 | 2.50 | 3330 | 9557 | 2.0 | 0 | 3330 | 0 | 1995 | 0 | 47.5526 | -122.102 | 3360 | 9755 | 2015 | 3 | 25 | Winter | 12 | 84 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 11.166806 |
| 700000 | 4 | 2.25 | 2440 | 9450 | 1.5 | 0 | 2440 | 0 | 1947 | 2014 | 47.7061 | -122.307 | 1720 | 7503 | 2014 | 10 | 30 | Fall | 44 | 303 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 12.037276 |
| 178500 | 3 | 1.00 | 900 | 10511 | 1.0 | 0 | 900 | 0 | 1961 | 0 | 47.2883 | -122.272 | 1460 | 10643 | 2015 | 2 | 12 | Winter | 7 | 43 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 36.721387 |
The exploratory data analysis (EDA) conducted on the King County house sales dataset is an in-depth exploration aimed at uncovering patterns, anomalies, and relationships within the data. This comprehensive EDA includes a variety of analyses to gain a holistic understanding of the dataset’s characteristics.
price, sqft_living, sqft_lot, and
others. Key aspects include examining their distributions, identifying
potential outliers, and understanding their range and central
tendencies.bedrooms, bathrooms, floors,
waterfront, view, condition, and
grade are analyzed.price.year_sold, month_sold, and season
on house prices.distance_to_convergence variable.This analysis focuses on continuous variables like
price, sqft_living, sqft_lot, and
others. Key aspects include examining their distributions, identifying
potential outliers, and understanding their range and central
tendencies.
In the scatter plot above, we compare the price of homes
against their sqft_living (square footage of interior
living space). This visualization allows us to explore the relationship
between these two variables.
The histogram above displays the distribution of
sqft_living. It reveals that the variable is right-skewed,
with most homes having smaller living spaces and relatively fewer very
large living spaces.
The scatter plot above compares price against
sqft_lot (square footage of land space). It helps us
understand if there’s any relationship between the size of the lot and
the sale price.
The histogram above visualizes the distribution of
sqft_lot. Similar to sqft_living, this
variable is right-skewed, with most homes having smaller lot sizes and
relatively fewer very large lots.
In the scatter plot above, we compare price against
sqft_above (square footage of the interior housing space
above ground level). This analysis helps us explore the impact of
above-ground living space on home prices.
The histogram above shows the distribution of
sqft_above. It suggests that most homes have similar
above-ground square footage, with relatively fewer having significantly
larger or smaller above-ground spaces.
Excluding homes that do not have a basement.
The scatter plot above compares price against
sqft_basement (square footage of the interior housing space
below ground level). This visualization helps us understand if the
presence and size of a basement influence home prices.
The histogram above visualizes the distribution of
sqft_basement. It indicates that most homes have little to
no basement space, while some have larger basement areas.
The scatter plot above compares price against the year
when homes were initially built (yr_built). This analysis
helps us understand how the age of a home relates to its sale price.
The histogram above displays the distribution of
yr_built. It provides insights into the distribution of
home ages in the dataset.
Excluding homes that did not have a documented renovation.
In the scatter plot above, we compare price against the
year of the last renovation (yr_renovated). This analysis
helps us understand whether recent renovations impact home prices.
The histogram above visualizes the distribution of
yr_renovated. It provides insights into the distribution of
renovation years in the dataset.
The scatter plot above compares price against
distance_to_convergence. This analysis helps us explore
whether the distance to a convergence point impacts home prices.
The distribution and count of categorical variables such as
bedrooms, bathrooms, floors,
waterfront, view, condition, and
grade are analyzed.
The scatter plot above compares price against the number
of bedrooms. This visualization helps us understand how the
number of bedrooms influences home prices.
The bar plot above displays the distribution of the
bedrooms variable, showing the frequency of each bedroom
count.
In the scatter plot above, we compare price against the
number of bathrooms. This analysis helps us explore the
relationship between the number of bathrooms and home prices.
The bar plot above visualizes the distribution of the
bathrooms variable, showing the frequency of each bathroom
count.
The scatter plot above compares price against the number
of floors. This analysis helps us understand how the number
of floors in a home relates to its sale price.
The bar plot above displays the distribution of the
floors variable, showing the frequency of each floor
count.
In the scatter plot above, we compare price against the
waterfront variable. This visualization helps us explore
how having a waterfront view impacts home prices.
The bar plot above visualizes the distribution of the
waterfront variable, showing the frequency of waterfront
and non-waterfront properties.
The scatter plot above compares price against the
view variable, which represents the quality of the
property’s view. This analysis helps us explore how the view quality
impacts home prices.
The bar plot above displays the distribution of the view
variable, showing the frequency of different view quality ratings.
In the scatter plot above, we compare price against the
condition variable, which represents the condition of the
property. This analysis helps us explore how property condition relates
to home prices.
The bar plot above visualizes the distribution of the
condition variable, showing the frequency of different
condition ratings.
The scatter plot above compares price against the
grade variable, which has been aggregated into categories
as per the provided header. This analysis helps us explore how the grade
of construction and design impacts home prices.
The bar plot above displays the distribution of the
grade_category variable, showing the frequency of different
grade categories.
Understanding how continuous variables correlate with each other and,
more importantly, with the target variable price.
| Variable | Correlation with Price |
|---|---|
| price | 1.0000000 |
| sqft_living | 0.7055923 |
| grade_category | 0.6713220 |
| grade_category_numeric | 0.6713220 |
| sqft_above | 0.6090981 |
| sqft_living15 | 0.5872083 |
| bathrooms | 0.5337709 |
| view_category | 0.4016872 |
| grade_11 | 0.3693088 |
| grade_10 | 0.3326084 |
| sqft_basement | 0.3278610 |
| bedrooms | 0.3141109 |
| view_4 | 0.3107036 |
| lat | 0.3066488 |
| grade_12 | 0.2927317 |
| zipcode_98004 | 0.2685431 |
| floors | 0.2593524 |
| grade_9 | 0.2326972 |
| grade_13 | 0.2198225 |
| zipcode_98039 | 0.2087856 |
| Variable1 | Variable2 | Correlation |
|---|---|---|
| day_of_year | week_of_year | 0.9996942 |
| week_of_year | day_of_year | 0.9996942 |
| day_of_year | month_sold | 0.9958286 |
| month_sold | day_of_year | 0.9958286 |
| week_of_year | month_sold | 0.9955271 |
| month_sold | week_of_year | 0.9955271 |
| season | month_sold | 0.9651301 |
| month_sold | season | 0.9651301 |
| day_of_year | season | 0.9617683 |
| season | day_of_year | 0.9617683 |
| week_of_year | season | 0.9616263 |
| season | week_of_year | 0.9616263 |
| sqft_above | sqft_living | 0.8790242 |
| sqft_living | sqft_above | 0.8790242 |
| condition_4 | condition_3 | -0.8128547 |
| condition_3 | condition_4 | -0.8128547 |
sqft_above &
sqft_living: Both variables are highly correlated
because the square footage of the living area above ground
(sqft_above) is part of the total square footage of living
space (sqft_living). We remove sqft_above as
it is likely to contain less unique information than the total living
space.
week_of_year, day_of_year,
month_sold: These variables are related to the
date the house was sold and are thus inherently correlated.
day_of_year carries the most granular information, so we
might prefer to keep it and remove week_of_year and
month_sold which provide more aggregated temporal
information.
condition_4 &
condition_3: The condition of the house is a
categorical variable that has been one-hot encoded. Since these are
mutually exclusive categories, they are negatively correlated. We might
decide to keep one category as the reference group and remove the
others, or revert to the original categorical variable to capture the
overall condition in a single variable.
By removing these variables, we aim to reduce multicollinearity, which can distort the estimated regression coefficients, inflate standard errors, and undermine the statistical significance of the predictors. The goal is to retain the variables that provide the unique and informative contribution to the model’s prediction of house prices.
In the table above, we’ve displayed the top 20 correlation values
with the target variable price, sorted by their absolute
values. Here are some of the key findings:
sqft_living, sqft_above,
sqft_living15, and bathrooms exhibit strong
positive correlations with the target variable. This suggests that as
these variables increase, the house price tends to increase as
well.grade_11_13, view_4, and
grade_8_10 also show positive correlations, indicating that
higher-grade properties and better views tend to have higher
prices.sqft_living and grade_11_13 appear to be
strong predictors of price.zipcode_98004,
zipcode_98039, and zipcode_98040, also have
notable positive correlations, indicating the significance of location
in price determination.Analyzing the influence of time-related features such as
month, and season on house prices.
Investigating the spatial aspect by analyzing the
distance_to_convergence variable.
This detailed review of the King County house sales dataset underscores the thorough preparation undertaken for the predictive analysis. The dataset’s diverse variables, both continuous and categorical, have been meticulously processed and analyzed, providing a robust foundation for developing the predictive model. With the comprehensive EDA and graphical analysis, we gain valuable insights into the correlations and distributions within the data, setting the stage for effective model building and accurate house price prediction.
Up to this point, we have successfully conducted an exploratory data
analysis (EDA) to gain valuable insights into the dataset. We’ve
visualized key features such as price, bedrooms, bathrooms, and more,
allowing us to better understand the data’s distribution and
relationships. Additionally, we’ve explored various trends, including
monthly, seasonal, weekly, and daily trends in both house prices and the
count of homes sold. Furthermore, we have cleaned and prepared the data,
removing irrelevant variables like id, lat,
and long to streamline it for modeling. With these
preliminary steps completed, we are now ready to delve into the model
development process.
To commence the model development process, we establish an Ordinary Least Squares (OLS) regression model as our baseline. This initial model utilizes the features that have undergone transformation and cleaning during the exploratory data analysis (EDA) phase. To maintain data quality, enhance model performance, and facilitate interpretability, we begin by removing columns introduced in prior graphical iterations. Additionally, we employ the standard data preprocessing practice of dropping columns with missing values (NA) to ensure dataset integrity. This step ensures that our subsequent analyses and models are built upon a robust and complete dataset, minimizing errors and potential biases.
linear_model_initial <- lm(price ~ ., data = train_df_linear)
df_results <- add_model_performance(
model_name = "OLS_linear",
model = linear_model_initial,
train_df = train_df_linear,
test_df = test_df_linear,
target_var = "price",
features = NULL, # Replace NULL with a vector of feature names if needed
df_results = NULL # Pass the previously created results
)
linear_model_initial_2 <- lm(price ~ bathrooms + bedrooms + yr_renovated, data = train_df_linear)
df_results <- add_model_performance(
model_name = "OLS_linear_2",
model = linear_model_initial_2,
train_df = train_df_linear,
test_df = test_df_linear,
target_var = "price",
features = c("bathrooms", "bedrooms", "yr_renovated"),
df_results = df_results
)
view_model_results(df_results)
sqft_living, waterfront1, and
bedrooms have significant coefficients, implying a notable
impact on housing prices.sqft_lot15 show less significance, suggesting a minor
influence on price.grade_3 to grade_12) compared to a
baseline grade (the omitted variable) are intriguing. This suggests that
higher grades (implying better quality) are associated with lower
prices, which warrants further investigation for data inconsistencies or
other underlying factors.print(vif(linear_model_initial))
## bedrooms bathrooms sqft_living
## 1.749979 3.525221 7.073569
## sqft_lot floors waterfront1
## 2.036098 2.543353 1.618081
## sqft_basement yr_built yr_renovated
## 2.235576 3.192618 1.192821
## sqft_living15 sqft_lot15 year_sold
## 3.402636 2.205563 3.644454
## day_sold seasonSpring seasonSummer
## 1.110920 4.375868 11.727556
## seasonFall day_of_year zipcode_98001
## 21.826740 18.736627 3.593959
## zipcode_98002 zipcode_98003 zipcode_98004
## 2.465027 3.152402 2.484314
## zipcode_98005 zipcode_98006 zipcode_98007
## 1.784149 2.975940 1.528981
## zipcode_98008 zipcode_98010 zipcode_98011
## 2.018367 1.805764 1.727180
## zipcode_98014 zipcode_98019 zipcode_98022
## 1.802351 1.787064 4.580666
## zipcode_98023 zipcode_98024 zipcode_98027
## 5.067811 1.423480 2.381506
## zipcode_98028 zipcode_98029 zipcode_98030
## 1.886239 2.129958 2.320167
## zipcode_98031 zipcode_98032 zipcode_98033
## 2.200350 1.629344 2.649313
## zipcode_98034 zipcode_98038 zipcode_98039
## 2.855103 4.653802 1.259242
## zipcode_98040 zipcode_98042 zipcode_98045
## 2.139123 4.109002 2.914920
## zipcode_98052 zipcode_98053 zipcode_98055
## 3.037300 2.444634 1.877824
## zipcode_98056 zipcode_98058 zipcode_98059
## 2.415166 2.641924 2.578546
## zipcode_98065 zipcode_98070 zipcode_98072
## 2.706643 1.757420 1.925572
## zipcode_98074 zipcode_98075 zipcode_98077
## 2.495220 2.373531 1.769284
## zipcode_98092 zipcode_98102 zipcode_98103
## 3.713169 1.368582 2.954804
## zipcode_98105 zipcode_98106 zipcode_98107
## 1.895194 2.110087 1.899568
## zipcode_98108 zipcode_98109 zipcode_98112
## 1.595029 1.410776 2.112834
## zipcode_98115 zipcode_98116 zipcode_98117
## 2.979952 2.092912 2.636110
## zipcode_98118 zipcode_98119 zipcode_98122
## 2.753277 1.590052 2.114685
## zipcode_98125 zipcode_98126 zipcode_98133
## 2.321204 2.156320 2.586399
## zipcode_98136 zipcode_98144 zipcode_98146
## 1.837647 2.305358 2.024811
## zipcode_98148 zipcode_98155 zipcode_98166
## 1.241295 2.423348 1.999003
## zipcode_98168 zipcode_98177 zipcode_98178
## 1.890186 1.954690 1.843229
## zipcode_98188 zipcode_98198 view_1
## 1.529140 2.273753 1.061534
## view_2 view_3 view_4
## 1.113534 1.124551 1.658200
## condition_2 condition_4 condition_5
## 1.041903 1.339192 1.259015
## grade_3 grade_4 grade_5
## 1.267163 3.296192 20.758889
## grade_6 grade_7 grade_8
## 166.843058 454.882839 371.925591
## grade_9 grade_10 grade_11
## 195.812087 88.269271 33.385350
## grade_12 distance_to_convergence
## 8.414754 20.045560
Explanation: In this section, we load the necessary
libraries for modeling, including lmtest and
car. We then use the lm() function to create
the initial OLS regression model, where we predict the
price based on all available features in the
train_df dataset. The summary() function
provides detailed information about the model, including coefficients,
significance levels, and goodness-of-fit statistic# Feature Evaluation
After constructing the initial model, we proceed with feature
evaluation. This step involves investigating the importance of each
predictor in the model. We assess the significance of individual
variables and consider whether to remove or combine categorical
variables to optimize model performance. The R code below demonstrates
how to evaluate the importance of features:
predictions <- predict(linear_model_initial, newdata = test_df_linear)
# Assuming 'actual_values' is the actual target variable values in your test dataset
actual_values <- test_df_linear$price
# Calculate Mean Absolute Error (MAE)
mae <- mean(abs(predictions - actual_values))
# Calculate Root Mean Squared Error (RMSE)
rmse <- sqrt(mean((predictions - actual_values)^2))
# Calculate R-squared (R2)
residuals <- actual_values - predictions
ss_total <- sum((actual_values - mean(actual_values))^2)
ss_residual <- sum(residuals^2)
r2 <- 1 - (ss_residual / ss_total)
# Print the evaluation metrics
cat("MAE:", mae, "\n")
## MAE: 88389.96
cat("RMSE:", rmse, "\n")
## RMSE: 150221.9
cat("R-squared:", r2, "\n")
## R-squared: 0.8311833
Explanation: In this section, we perform feature
evaluation using stepwise selection. This technique automatically
evaluates the significance of each predictor and iteratively selects the
most relevant variables for the model. The step() function
optimizes the model by adding or removing variables based on their
significance levels. The resulting stepwise_model provides
insights into the final set of predictors and their impact on model
performance.
Copyright © 2023 Charles and with AI assistance. Fabricated with blood, sweat, and more coffee than a pod of overcaffeinated narwhals. Unauthorized replication of this work will result in a jester performing terribly lame jokes at your next family gathering. Proceed with caution.